City College of San Francisco


MATH 108 - Foundations of Data Science

Lecture 07: Charts¶

Associated Textbook Sections: 7.0, 7.1

Overview¶

  • Why Do We Visualize Data
  • Course Visualizations
  • Numerical Data
  • Categorical Data

Set Up the Notebook¶

In [1]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

Why Do We Visualize Data¶

  • A large fraction of our brains are dedicated to visual reasoning.
  • In Data Science we use visualization:
    • For others – to communicate our findings
    • For ourselves – to understand our data, see patterns, and discover relationships

Demo: Identifying Data Type of Column Values¶

Load the actors.csv data. The 'Total Gross', 'Average per Movie', and 'Gross' values represent Thousands of Dollars

In [2]:
actors = Table().read_table('./data/actors.csv')
actors
Out[2]:
Actor Total Gross Number of Movies Average per Movie #1 Movie Gross
Harrison Ford 4871.7 41 118.8 Star Wars: The Force Awakens 936.7
Samuel L. Jackson 4772.8 69 69.2 The Avengers 623.4
Morgan Freeman 4468.3 61 73.3 The Dark Knight 534.9
Tom Hanks 4340.8 44 98.7 Toy Story 3 415
Robert Downey, Jr. 3947.3 53 74.5 The Avengers 623.4
Eddie Murphy 3810.4 38 100.3 Shrek 2 441.2
Tom Cruise 3587.2 36 99.6 War of the Worlds 234.3
Johnny Depp 3368.6 45 74.9 Dead Man's Chest 423.3
Michael Caine 3351.5 58 57.8 The Dark Knight 534.9
Scarlett Johansson 3341.2 37 90.3 The Avengers 623.4

... (40 rows omitted)

The actor's name is a categorical attribute.

In [5]:
# use type( ) to check the type of value
type(actors.column('Actor').item(0))
Out[5]:
str

The total gross dollar is a numerical attribute.

In [9]:
type(actors.column('Total Gross').item(0))
Out[9]:
float

Course Visualizations¶

  • In the course we will mostly use the following visualizations:
    • Histograms
    • Line Graphs
    • Scatter Plots
    • Bar Charts
  • You will indirectly work withe standard Matplotlib Python library for data visualization using the datascience library.
  • It may be helpful to overlay graphs to explore relationships.
  • How you visualize your data depends on attribute type.
  • The data type doesn't determine numerical/categorical attribute label.
    • '$12.00' is a str and likely to refelect a numerical attribute.

Numerical Data¶

Visualizing the Distribution of One Numerical Variable¶

Histograms tbl.hist are a standard way to visualize the distribution of one numerical variable.

Histograms will be focused on in the next lecture.

A Histogram¶

In [10]:
# tablename.hist('columnname', unit = "  ") to create a histogram
actors.hist('Total Gross', unit="Thousands of Dollars") 

# Some extra graph formatting you are not responsible for
plots.title('Distribution of Total Gross')
plots.show()

Plotting Two Numerical Variables¶

Line graphs tbl.plot and Scatter plots tbl.scatter are standard ways to visualize the relationship of two numerical variables.

A Line Graph¶

In [13]:
# using .group to count the number of movies released each year
top_movies = Table.read_table('./data/top_movies_2023.csv')
movies_per_year = top_movies.group('Year').relabeled('count', 'Number of Movies')
# movies_per_year.show()
# tablename.plot( 'x-axis', 'y-axis')
movies_per_year.where('Year', are.above(1999)).plot('Year', 'Number of Movies') 

plots.xticks(np.arange(2000, 2023, 5))
plots.title('Number of Movies vs. Release Year')
plots.show()

A Scatter Plot¶

In [14]:
# scatter plot tablename.scatter('x-axis', 'y-axis')
actors.scatter('Number of Movies', 'Average per Movie')

plots.title('Average Pay per Movie (Thousands of Dollars) vs. Number of Movies')
plots.show()

When to use a line vs scatter plot?¶

  • Use line plots for sequential data if:
    • ... your x-axis has an order
    • ... sequential differences in y values are meaningful
    • ... there's only one y-value for each x-value
  • Usually: x-axis is time or distance
  • Use scatter plots for non-sequential data --- When you’re looking for associations

Demo: Census¶

Explore the US Census data from the Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States.

(Release date: June 2021, Updated January 2022 to include April 1, 2020 estimates)

In [15]:
url = 'https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/asrh/nc-est2020-agesex-res.csv'
full = Table.read_table(url)
full
Out[15]:
SEX AGE CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014 POPESTIMATE2015 POPESTIMATE2016 POPESTIMATE2017 POPESTIMATE2018 POPESTIMATE2019 POPESTIMATE2020
0 0 3944153 3944160 3951495 3963264 3926731 3931411 3954973 3984144 3963268 3882437 3826908 3762227 3735010
0 1 3978070 3978090 3957904 3966768 3978210 3943348 3949559 3973828 4003586 3981864 3897917 3842257 3773884
0 2 4096929 4096939 4090799 3971498 3980139 3993047 3960015 3967672 3992657 4021261 3996742 3911822 3853025
0 3 4119040 4119051 4111869 4102429 3983007 3992839 4007852 3976277 3984985 4009060 4035053 4009037 3921526
0 4 4063170 4063186 4077511 4122252 4112849 3994539 4006407 4022785 3992241 4000394 4021907 4045996 4017847
0 5 4056858 4056872 4064653 4087770 4132349 4123745 4007123 4020489 4038022 4007233 4012789 4032231 4054336
0 6 4066381 4066412 4073031 4075153 4097860 4142923 4135738 4020428 4034969 4052428 4019106 4022432 4040169
0 7 4030579 4030594 4043100 4083399 4085255 4108453 4154947 4148711 4034355 4048430 4063647 4027876 4029753
0 8 4046486 4046497 4025624 4053313 4093553 4096033 4120476 4167765 4162142 4047130 4059209 4071894 4034785
0 9 4148353 4148369 4125413 4035854 4063662 4104437 4107986 4133426 4181069 4175085 4058207 4067320 4078668

... (296 rows omitted)

In the previous lecture, we did the following:

  • Select the SEX, AGE, CENSUS2010POP, and POPESTIMATE2019 columns.
  • Relabel the 2010 and 2019 columns.
  • Remove the 999 ages and focus just on the combined data where the SEX value is 0. Drop the SEX column since there is only one value there.
In [16]:
partial = full.select('SEX', 'AGE', 'CENSUS2010POP', 'POPESTIMATE2019')
simple = partial.relabeled(2, '2010').relabeled(3, '2019')
no_999 = simple.where('AGE', are.below(999))
everyone = no_999.where('SEX', 0).drop('SEX')
everyone
Out[16]:
AGE 2010 2019
0 3944153 3762227
1 3978070 3842257
2 4096929 3911822
3 4119040 4009037
4 4063170 4045996
5 4056858 4032231
6 4066381 4022432
7 4030579 4027876
8 4046486 4071894
9 4148353 4067320

... (91 rows omitted)

Visualize the relationship between age and population size in 2010.

In [20]:
#everyone.plot('AGE', '2010')
everyone.relabeled('2010', 'count').plot('AGE', 'count') 

plots.title('US Population Size in 2010') 
plots.show()

Include lines for both 2010 and the estimated 2019 population sizes.

In [21]:
# leave off 2010 label and it will plot for each column
everyone.plot('AGE')

plots.title('US Population Size') 
plots.show()

Demo: Male and Female 2019 Estimates¶

Create a table with Age, Males, Females columns showing the population estimates in 2019 for males and females by age.

In [23]:
# first, delete female rows to create males table
# then, delete male rows to create females table (each of these will have one row for each age)
# using the age column from the males table (or the females table), then paste on male and female counts

males = no_999.where('SEX', 1).drop('SEX')
females = no_999.where('SEX', 2).drop('SEX')
pop_2019 = Table().with_columns(
    'Age', males.column('AGE'),
    'Males', males.column('2019'),
    'Females', females.column('2019')
)
pop_2019
Out[23]:
Age Males Females
0 1921001 1841226
1 1963261 1878996
2 2000102 1911720
3 2048651 1960386
4 2068251 1977745
5 2063176 1969055
6 2055583 1966849
7 2058425 1969451
8 2082403 1989491
9 2075719 1991601

... (91 rows omitted)

Visualize the distribution of of population size for both males and females.

In [24]:
pop_2019.plot('Age')

plots.title('2019 Population Size Estimates')
plots.show()

Calculate the percent female for each age

In [26]:
# total number of people = males.column('2019') + females.column('2019')
pct_female = (females.column('2019') / (males.column('2019') + females.column('2019'))) * 100
pct_female
Out[26]:
array([ 48.93979018,  48.90344399,  48.87032181,  48.89917454,
        48.88153622,  48.83289177,  48.89701056,  48.89552211,
        48.85910586,  48.96592842,  48.98425388,  48.96313718,
        48.91848904,  48.91588355,  48.95682562,  48.99213593,
        49.00723665,  48.9917086 ,  48.94499775,  48.85555766,
        48.8800806 ,  48.89699809,  48.95129043,  48.84655675,
        48.77220901,  48.76311842,  48.68996749,  48.84567382,
        49.115004  ,  49.23311185,  49.27161137,  49.33570713,
        49.34690992,  49.39653681,  49.57328862,  49.7823678 ,
        49.88801204,  49.99258886,  50.08019625,  49.89892133,
        50.1409379 ,  50.20977831,  50.37327215,  50.36508359,
        50.27570341,  50.48253869,  50.64261911,  50.57544456,
        50.61870656,  50.44489454,  50.56911629,  50.63449931,
        50.80649435,  50.81894266,  50.89138769,  51.13627062,
        51.2696241 ,  51.37238838,  51.53410868,  51.46437873,
        51.72648051,  51.88456258,  52.09723728,  52.31329221,
        52.44314993,  52.76149769,  52.92230043,  53.03484444,
        53.26468499,  53.27081102,  53.40722561,  53.44223716,
        53.51022877,  53.9509406 ,  54.25448772,  54.58073446,
        54.83251151,  55.26819948,  55.82854715,  56.17047137,
        56.3748233 ,  57.03744511,  57.64539476,  58.2875019 ,
        59.12037315,  59.77448788,  60.61994754,  61.50555207,
        62.43469375,  63.42875214,  64.36264302,  65.56129226,
        66.59478489,  67.76493653,  69.03326813,  70.06426052,
        70.77789932,  72.11473518,  72.70429851,  74.48479938,  76.57254933])

Round the values to 3 decimal places so that it's easier to read.

In [28]:
pct_female = np.round(pct_female, 2)
pct_female
Out[28]:
array([ 48.94,  48.9 ,  48.87,  48.9 ,  48.88,  48.83,  48.9 ,  48.9 ,
        48.86,  48.97,  48.98,  48.96,  48.92,  48.92,  48.96,  48.99,
        49.01,  48.99,  48.94,  48.86,  48.88,  48.9 ,  48.95,  48.85,
        48.77,  48.76,  48.69,  48.85,  49.12,  49.23,  49.27,  49.34,
        49.35,  49.4 ,  49.57,  49.78,  49.89,  49.99,  50.08,  49.9 ,
        50.14,  50.21,  50.37,  50.36,  50.28,  50.48,  50.64,  50.58,
        50.62,  50.44,  50.57,  50.63,  50.81,  50.82,  50.89,  51.14,
        51.27,  51.37,  51.53,  51.46,  51.73,  51.88,  52.1 ,  52.31,
        52.44,  52.76,  52.92,  53.04,  53.26,  53.27,  53.41,  53.44,
        53.51,  53.95,  54.25,  54.58,  54.83,  55.27,  55.83,  56.17,
        56.38,  57.04,  57.64,  58.29,  59.12,  59.77,  60.62,  61.51,
        62.44,  63.43,  64.36,  65.56,  66.6 ,  67.76,  69.03,  70.06,
        70.78,  72.11,  72.7 ,  74.48,  76.57])

Add female percent to our table

In [29]:
pop_2019 = pop_2019.with_column('Percent Female', pct_female)
pop_2019
Out[29]:
Age Males Females Percent Female
0 1921001 1841226 48.94
1 1963261 1878996 48.9
2 2000102 1911720 48.87
3 2048651 1960386 48.9
4 2068251 1977745 48.88
5 2063176 1969055 48.83
6 2055583 1966849 48.9
7 2058425 1969451 48.9
8 2082403 1989491 48.86
9 2075719 1991601 48.97

... (91 rows omitted)

Visualize the relationship between age and the percent of the population that is female.

In [30]:
pop_2019.plot('Age', 'Percent Female')

plots.title('Female Population Percentage over Age')
plots.show()

Be careful of being visually mislead by the y-axis.

In [31]:
pop_2019.plot('Age', 'Percent Female')

plots.ylim(0, 100);
plots.title('Female Population Percentage over Age')
plots.show()

Demo: Scatter Plots¶

Visualize the relationship between the number of movies and the average pay per movie for each actor in the dataset.

In [32]:
# pull up actors data set again
# use .scatter
actors.scatter('Number of Movies', 'Average per Movie')

plots.title('Average per Movie (Thousands of Dollars) vs. Number of Movies')
plots.show()

Identify the outlier in the dataset.

In [34]:
# trying to find the movie with the average per movie above 400
# actors.sort('Average per Movie', descending=True)
# or 
actors.where('Average per Movie', are.above(400))
Out[34]:
Actor Total Gross Number of Movies Average per Movie #1 Movie Gross
Anthony Daniels 3162.9 7 451.8 Star Wars: The Force Awakens 936.7
In [35]:
# to find the value
max(actors.column('Average per Movie'))
Out[35]:
451.80000000000001
In [36]:
# to find actor
max_ave = max(actors.column('Average per Movie'))
actors.where('Average per Movie', max_ave).column('Actor').item(0)
Out[36]:
'Anthony Daniels'

For all the visualization methods we use from the datascience library, if you put an i infront of the name of the visualization, you can access an interactive version of plot that is based on another visualization library called Plotly. You will not be tested on your knowledge of these interactive plots. You might find them helpful for exploring the data.

In [37]:
actors.iscatter('Number of Movies', 
                'Average per Movie', 
                labels='Actor', 
                title='Average per Movie (Thousands of Dollars) vs. Number of Movies')

Categorical Data¶

  • (Horizontal) Bar charts barh are a standard way to visualize the distribution of a single categorical variable.
  • Pie charts are generally discouraged because most people have a difficult time visually interpreting angles compared to lengths of bars.

A Bar Chart¶

In [39]:
cones = Table().read_table('./data/cones.csv')
cones_grouped_by_flavor = cones.group('Flavor')
cones_grouped_by_flavor.barh('Flavor')

plots.title('Distrubtion of Ice Cream Flavors')
plots.show()

Demo: Bar Charts¶

The dataset top_movies_2023.csv shows the highest 1,000 grossing movies world wide listed on IMDB. Adjusted total gross values were also provided for data before 2021 using the Consumer Price Index (CPI)-based Python library cpi.

In [40]:
top_movies
Out[40]:
Created Modified Title URL Title Type IMDb Rating Runtime (mins) Year Genres Num Votes Release Date Directors Gross Gross (Adjusted)
2023-01-06 2023-01-06 Gone with the Wind https://www.imdb.com/title/tt0031381/ movie 8.2 238 1939 Drama, Romance, War 318271 1939-12-15 Sam Wood, George Cukor, Victor Fleming 402382193 7.84414e+09
2023-01-06 2023-01-06 Bambi https://www.imdb.com/title/tt0034492/ movie 7.3 69 1942 Animation, Adventure, Drama, Family 145676 1942-08-09 Samuel Armstrong, Paul Satterfield, Graham Heid, James A ... 267447150 4.44602e+09
2023-01-06 2023-01-06 Titanic https://www.imdb.com/title/tt0120338/ movie 7.9 194 1997 Drama, Romance 1187108 1997-11-01 James Cameron 2201647264 3.71701e+09
2023-01-06 2023-01-06 Avatar https://www.imdb.com/title/tt0499549/ movie 7.9 162 2009 Action, Adventure, Fantasy, Sci-Fi 1318546 2009-12-10 James Cameron 2922917914 3.69178e+09
2023-01-06 2023-01-06 Snow White and the Seven Dwarfs https://www.imdb.com/title/tt0029583/ movie 7.6 83 1937 Animation, Adventure, Family, Fantasy, Musical, Romance 202792 1937-12-21 William Cottrell, Ben Sharpsteen, David Hand, Perce Pear ... 184925486 3.47981e+09
2023-01-06 2023-01-06 Star Wars https://www.imdb.com/title/tt0076759/ movie 8.6 121 1977 Action, Adventure, Fantasy, Sci-Fi 1372821 1977-05-25 George Lucas 775398007 3.46716e+09
2023-01-06 2023-01-06 Avengers: Endgame https://www.imdb.com/title/tt4154796/ movie 8.4 181 2019 Action, Adventure, Drama, Sci-Fi 1144892 2019-04-22 Anthony Russo, Joe Russo 2797501328 2.96506e+09
2023-01-06 2023-01-06 The Exorcist https://www.imdb.com/title/tt0070047/ movie 8.1 122 1973 Horror 413376 1973-12-26 William Friedkin 441306145 2.69326e+09
2023-01-06 2023-01-06 Jaws https://www.imdb.com/title/tt0073195/ movie 8.1 124 1975 Adventure, Thriller 612946 1975-06-20 Steven Spielberg 476512065 2.40001e+09
2023-01-06 2023-01-06 Star Wars: Episode VII - The Force Awakens https://www.imdb.com/title/tt2488496/ movie 7.8 138 2015 Action, Adventure, Sci-Fi 936837 2015-12-14 J.J. Abrams 2069521700 2.36598e+09

... (990 rows omitted)

Since Gone with the Wind has been re-released several times, the adjusted price is not the most honest representation of its adjusted gross proces. For a more comparable analysis, reduce the table to the top top 10 movies based on actual gross values ('Gross (Adjusted)') for the movies releasted in the last decade.

In [42]:
# pare down the columns
top_movies_select = top_movies.select('Title', 'Year', 'Gross (Adjusted)')

# remove all movies from before 2012
top_movies_last_decade = top_movies_select.where('Year', are.above(2011))

# sort by gross (adjusted)
top_movies_last_decade_sorted = top_movies_last_decade.sort('Gross (Adjusted)', True)

# table method .take( ) with a range inside
top10 = top_movies_last_decade_sorted.take(np.arange(10))
top10
Out[42]:
Title Year Gross (Adjusted)
Avengers: Endgame 2019 2.96506e+09
Star Wars: Episode VII - The Force Awakens 2015 2.36598e+09
Avengers: Infinity War 2018 2.21039e+09
Spider-Man: No Way Home 2021 1.91631e+09
Jurassic World 2015 1.91099e+09
The Avengers 2012 1.79253e+09
The Lion King 2019 1.76269e+09
Fast & Furious 7 2015 1.73242e+09
Avengers: Age of Ultron 2015 1.60376e+09
Frozen II 2019 1.53688e+09

Convert to the gross (adjusted) values to billions of dollars for readability.

In [44]:
# take Gross (adjusted) and divide by 1 bill and round to 2 dps
billions = np.round(top10.column('Gross (Adjusted)') / 1000000000, 2)

# add the column to top10
top10 = top10.with_column('Gross Adjusted (Billions)', billions).select(0, 1, 3)
top10
Out[44]:
Title Year Gross Adjusted (Billions)
Avengers: Endgame 2019 2.97
Star Wars: Episode VII - The Force Awakens 2015 2.37
Avengers: Infinity War 2018 2.21
Spider-Man: No Way Home 2021 1.92
Jurassic World 2015 1.91
The Avengers 2012 1.79
The Lion King 2019 1.76
Fast & Furious 7 2015 1.73
Avengers: Age of Ultron 2015 1.6
Frozen II 2019 1.54

Visualize the gross adjusted values for each of the top 10 grossing (adjusted) movies.

In [45]:
# use .barh
top10.barh('Title', 'Gross Adjusted (Billions)')

plots.title("The Top 10 Grossing Movies")
plots.show()

Visual Perception Accuracy¶

From Nathan Yau’s Data Points: Visualization that Means Something, our eyes can extract information at different levels of accuracy depending on the design.


Adopted from UC Berkeley DATA 8 course materials.

This content is offered under a CC Attribution Non-Commercial Share Alike license.